Data analyst vs Data scientist
“A data scientist is someone who can predict the future based on past patterns whereas a data analyst is someone who merely curates meaningful insights from data.”
“A data scientist is expected to generate their own questions while a data analyst finds answers to a given set of questions from data.” (https://www.projectpro.io/article/difference-between-data-analyst-and-data-scientist/332)
Įvadas
Darbui duomenys buvo paimti iš Kaggle puslapio:
https://www.kaggle.com/andrewmvd/data-scientist-jobs
https://www.kaggle.com/andrewmvd/data-analyst-jobs
Duomenų bazės atvaizduoja 2020-07 mėn. darbo skelbimus į duomenų mokslininko ir duomenų analitiko pozicijas iš JAV puslapio:
https://www.glassdoor.com/member/home/index.htm
Darbo tikslai:
1. Palyginti specialistų atlyginimus.
2. Kompiuterinių įgūdžių skirtumai/panašumai.
3. Kokiose darbo sektoriuose reikalingi šie specialistai? Atlyginimai pagal sektorius. TOP10 įmonių ir TOP10 miestų.
# Importuojame reikalingus modulius
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
analyst = pd.read_csv('C:\\...\\DATA_specialists\\DataAnalyst.csv')
analyst.head(3)
Unnamed: 0 | Job Title | Salary Estimate | Job Description | Rating | Company Name | Location | Headquarters | Size | Founded | Type of ownership | Industry | Sector | Revenue | Competitors | Easy Apply | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Data Analyst, Center on Immigration and Justic... | $37K-$66K (Glassdoor est.) | Are you eager to roll up your sleeves and harn... | 3.2 | Vera Institute of Justice\n3.2 | New York, NY | New York, NY | 201 to 500 employees | 1961 | Nonprofit Organization | Social Assistance | Non-Profit | $100 to $500 million (USD) | -1 | TRUE |
1 | 1 | Quality Data Analyst | $37K-$66K (Glassdoor est.) | Overview\n\nProvides analytical and technical ... | 3.8 | Visiting Nurse Service of New York\n3.8 | New York, NY | New York, NY | 10000+ employees | 1893 | Nonprofit Organization | Health Care Services & Hospitals | Health Care | $2 to $5 billion (USD) | -1 | -1 |
2 | 2 | Senior Data Analyst, Insights & Analytics Team... | $37K-$66K (Glassdoor est.) | We’re looking for a Senior Data Analyst who ha... | 3.4 | Squarespace\n3.4 | New York, NY | New York, NY | 1001 to 5000 employees | 2003 | Company - Private | Internet | Information Technology | Unknown / Non-Applicable | GoDaddy | -1 |
analyst.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2253 entries, 0 to 2252 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 2253 non-null int64 1 Job Title 2253 non-null object 2 Salary Estimate 2253 non-null object 3 Job Description 2253 non-null object 4 Rating 2253 non-null float64 5 Company Name 2252 non-null object 6 Location 2253 non-null object 7 Headquarters 2253 non-null object 8 Size 2253 non-null object 9 Founded 2253 non-null int64 10 Type of ownership 2253 non-null object 11 Industry 2253 non-null object 12 Sector 2253 non-null object 13 Revenue 2253 non-null object 14 Competitors 2253 non-null object 15 Easy Apply 2253 non-null object dtypes: float64(1), int64(2), object(13) memory usage: 281.8+ KB
scientist = pd.read_csv('C:\\...\\DATA_specialists\\DataScientist.csv')
scientist.head(3)
Unnamed: 0 | Job Title | Salary Estimate | Job Description | Rating | Company Name | Location | Headquarters | Size | Founded | Type of ownership | Industry | Sector | Revenue | Competitors | Easy Apply | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Senior Data Scientist | $111K-$181K (Glassdoor est.) | ABOUT HOPPER\n\nAt Hopper, we’re on a mission ... | 3.5 | Hopper\n3.5 | New York, NY | Montreal, Canada | 501 to 1000 employees | 2007 | Company - Private | Travel Agencies | Travel & Tourism | Unknown / Non-Applicable | -1 | -1 |
1 | 1 | Data Scientist, Product Analytics | $111K-$181K (Glassdoor est.) | At Noom, we use scientifically proven methods ... | 4.5 | Noom US\n4.5 | New York, NY | New York, NY | 1001 to 5000 employees | 2008 | Company - Private | Health, Beauty, & Fitness | Consumer Services | Unknown / Non-Applicable | -1 | -1 |
2 | 2 | Data Science Manager | $111K-$181K (Glassdoor est.) | Decode_M\n\nhttps://www.decode-m.com/\n\nData ... | -1.0 | Decode_M | New York, NY | New York, NY | 1 to 50 employees | -1 | Unknown | -1 | -1 | Unknown / Non-Applicable | -1 | TRUE |
scientist.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3909 entries, 0 to 3908 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 3909 non-null int64 1 Job Title 3909 non-null object 2 Salary Estimate 3909 non-null object 3 Job Description 3909 non-null object 4 Rating 3909 non-null float64 5 Company Name 3909 non-null object 6 Location 3909 non-null object 7 Headquarters 3909 non-null object 8 Size 3909 non-null object 9 Founded 3909 non-null int64 10 Type of ownership 3909 non-null object 11 Industry 3909 non-null object 12 Sector 3909 non-null object 13 Revenue 3909 non-null object 14 Competitors 3909 non-null object 15 Easy Apply 3909 non-null object dtypes: float64(1), int64(2), object(13) memory usage: 488.8+ KB
Po greitosios duomenų apžvalgos, matome, kad
# Kadangi įkeliant duomenis lentelėse atsirado indeksų dublikatai, pašalinkime po vieną stulpelį kiekvienoje lentelėje
analyst.drop(analyst.columns[[0]], axis=1, inplace=True)
analyst.head(3)
Job Title | Salary Estimate | Job Description | Rating | Company Name | Location | Headquarters | Size | Founded | Type of ownership | Industry | Sector | Revenue | Competitors | Easy Apply | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Data Analyst, Center on Immigration and Justic... | $37K-$66K (Glassdoor est.) | Are you eager to roll up your sleeves and harn... | 3.2 | Vera Institute of Justice\n3.2 | New York, NY | New York, NY | 201 to 500 employees | 1961 | Nonprofit Organization | Social Assistance | Non-Profit | $100 to $500 million (USD) | -1 | TRUE |
1 | Quality Data Analyst | $37K-$66K (Glassdoor est.) | Overview\n\nProvides analytical and technical ... | 3.8 | Visiting Nurse Service of New York\n3.8 | New York, NY | New York, NY | 10000+ employees | 1893 | Nonprofit Organization | Health Care Services & Hospitals | Health Care | $2 to $5 billion (USD) | -1 | -1 |
2 | Senior Data Analyst, Insights & Analytics Team... | $37K-$66K (Glassdoor est.) | We’re looking for a Senior Data Analyst who ha... | 3.4 | Squarespace\n3.4 | New York, NY | New York, NY | 1001 to 5000 employees | 2003 | Company - Private | Internet | Information Technology | Unknown / Non-Applicable | GoDaddy | -1 |
scientist.drop(scientist.columns[[0]], axis=1, inplace=True)
scientist.head(3)
Job Title | Salary Estimate | Job Description | Rating | Company Name | Location | Headquarters | Size | Founded | Type of ownership | Industry | Sector | Revenue | Competitors | Easy Apply | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Senior Data Scientist | $111K-$181K (Glassdoor est.) | ABOUT HOPPER\n\nAt Hopper, we’re on a mission ... | 3.5 | Hopper\n3.5 | New York, NY | Montreal, Canada | 501 to 1000 employees | 2007 | Company - Private | Travel Agencies | Travel & Tourism | Unknown / Non-Applicable | -1 | -1 |
1 | Data Scientist, Product Analytics | $111K-$181K (Glassdoor est.) | At Noom, we use scientifically proven methods ... | 4.5 | Noom US\n4.5 | New York, NY | New York, NY | 1001 to 5000 employees | 2008 | Company - Private | Health, Beauty, & Fitness | Consumer Services | Unknown / Non-Applicable | -1 | -1 |
2 | Data Science Manager | $111K-$181K (Glassdoor est.) | Decode_M\n\nhttps://www.decode-m.com/\n\nData ... | -1.0 | Decode_M | New York, NY | New York, NY | 1 to 50 employees | -1 | Unknown | -1 | -1 | Unknown / Non-Applicable | -1 | TRUE |
# Apjunkime lenteles concat f-jos pagalba į bendrą lentelę 'specialists' ir sukūrkime naują stulpelį 'specialist', kuriame
# priskirkime duomenims iš lentelės 'analyst' reikšmes 'Analyst', o iš lentelės 'scientist' reikšmes 'Scientist'
analyst['specialist'] = 'Analyst'
scientist['specialist'] = 'Scientist'
specialists = pd.concat([analyst, scientist], ignore_index=True)
specialists
Job Title | Salary Estimate | Job Description | Rating | Company Name | Location | Headquarters | Size | Founded | Type of ownership | Industry | Sector | Revenue | Competitors | Easy Apply | specialist | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Data Analyst, Center on Immigration and Justic... | $37K-$66K (Glassdoor est.) | Are you eager to roll up your sleeves and harn... | 3.2 | Vera Institute of Justice\n3.2 | New York, NY | New York, NY | 201 to 500 employees | 1961 | Nonprofit Organization | Social Assistance | Non-Profit | $100 to $500 million (USD) | -1 | TRUE | Analyst |
1 | Quality Data Analyst | $37K-$66K (Glassdoor est.) | Overview\n\nProvides analytical and technical ... | 3.8 | Visiting Nurse Service of New York\n3.8 | New York, NY | New York, NY | 10000+ employees | 1893 | Nonprofit Organization | Health Care Services & Hospitals | Health Care | $2 to $5 billion (USD) | -1 | -1 | Analyst |
2 | Senior Data Analyst, Insights & Analytics Team... | $37K-$66K (Glassdoor est.) | We’re looking for a Senior Data Analyst who ha... | 3.4 | Squarespace\n3.4 | New York, NY | New York, NY | 1001 to 5000 employees | 2003 | Company - Private | Internet | Information Technology | Unknown / Non-Applicable | GoDaddy | -1 | Analyst |
3 | Data Analyst | $37K-$66K (Glassdoor est.) | Requisition NumberRR-0001939\nRemote:Yes\nWe c... | 4.1 | Celerity\n4.1 | New York, NY | McLean, VA | 201 to 500 employees | 2002 | Subsidiary or Business Segment | IT Services | Information Technology | $50 to $100 million (USD) | -1 | -1 | Analyst |
4 | Reporting Data Analyst | $37K-$66K (Glassdoor est.) | ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl... | 3.9 | FanDuel\n3.9 | New York, NY | New York, NY | 501 to 1000 employees | 2009 | Company - Private | Sports & Recreation | Arts, Entertainment & Recreation | $100 to $500 million (USD) | DraftKings | TRUE | Analyst |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6157 | AWS Data Engineer | $55K-$112K (Glassdoor est.) | About Us\n\nTachyon Technologies is a Digital ... | 4.4 | Tachyon Technologies\n4.4 | Dublin, OH | Irving, TX | 201 to 500 employees | 2011 | Company - Private | IT Services | Information Technology | $10 to $25 million (USD) | -1 | -1 | Scientist |
6158 | Data Analyst â Junior | $55K-$112K (Glassdoor est.) | Job description\nInterpret data, analyze resul... | 5.0 | Staffigo Technical Services, LLC\n5.0 | Columbus, OH | Woodridge, IL | 51 to 200 employees | 2008 | Company - Private | IT Services | Information Technology | $50 to $100 million (USD) | -1 | -1 | Scientist |
6159 | Security Analytics Data Engineer | $55K-$112K (Glassdoor est.) | Job DescriptionThe Security Analytics Data Eng... | 3.8 | PDS Tech, Inc.\n3.8 | Dublin, OH | Irving, TX | 5001 to 10000 employees | 1977 | Company - Private | Staffing & Outsourcing | Business Services | $100 to $500 million (USD) | -1 | -1 | Scientist |
6160 | Security Analytics Data Engineer | $55K-$112K (Glassdoor est.) | The Security Analytics Data Engineer will inte... | 4.0 | Data Resource Technologies\n4.0 | Dublin, OH | Omaha, NE | 1 to 50 employees | -1 | Company - Private | Accounting | Accounting & Legal | Less than $1 million (USD) | -1 | -1 | Scientist |
6161 | Patient Safety Physician or Safety Scientist -... | $55K-$112K (Glassdoor est.) | Help us transform patients' lives.\nAt UCB, we... | 3.7 | UCB\n3.7 | Slough, OH | Brussel, Belgium | 5001 to 10000 employees | -1 | Company - Public | Biotech & Pharmaceuticals | Biotech & Pharmaceuticals | $2 to $5 billion (USD) | AbbVie | -1 | Scientist |
6162 rows × 16 columns
Apjungus lenetels apdorokime duomenis sekančia tvarka:
specialists = specialists.replace(['-1', '-1.0', -1], np.nan)
specialists.head()
Job Title | Salary Estimate | Job Description | Rating | Company Name | Location | Headquarters | Size | Founded | Type of ownership | Industry | Sector | Revenue | Competitors | Easy Apply | specialist | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Data Analyst, Center on Immigration and Justic... | $37K-$66K (Glassdoor est.) | Are you eager to roll up your sleeves and harn... | 3.2 | Vera Institute of Justice\n3.2 | New York, NY | New York, NY | 201 to 500 employees | 1961.0 | Nonprofit Organization | Social Assistance | Non-Profit | $100 to $500 million (USD) | NaN | TRUE | Analyst |
1 | Quality Data Analyst | $37K-$66K (Glassdoor est.) | Overview\n\nProvides analytical and technical ... | 3.8 | Visiting Nurse Service of New York\n3.8 | New York, NY | New York, NY | 10000+ employees | 1893.0 | Nonprofit Organization | Health Care Services & Hospitals | Health Care | $2 to $5 billion (USD) | NaN | NaN | Analyst |
2 | Senior Data Analyst, Insights & Analytics Team... | $37K-$66K (Glassdoor est.) | We’re looking for a Senior Data Analyst who ha... | 3.4 | Squarespace\n3.4 | New York, NY | New York, NY | 1001 to 5000 employees | 2003.0 | Company - Private | Internet | Information Technology | Unknown / Non-Applicable | GoDaddy | NaN | Analyst |
3 | Data Analyst | $37K-$66K (Glassdoor est.) | Requisition NumberRR-0001939\nRemote:Yes\nWe c... | 4.1 | Celerity\n4.1 | New York, NY | McLean, VA | 201 to 500 employees | 2002.0 | Subsidiary or Business Segment | IT Services | Information Technology | $50 to $100 million (USD) | NaN | NaN | Analyst |
4 | Reporting Data Analyst | $37K-$66K (Glassdoor est.) | ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl... | 3.9 | FanDuel\n3.9 | New York, NY | New York, NY | 501 to 1000 employees | 2009.0 | Company - Private | Sports & Recreation | Arts, Entertainment & Recreation | $100 to $500 million (USD) | DraftKings | TRUE | Analyst |
# Pažiūrėkime kokie yra skirtingi atlyginimo rėžiai ir jų kiekiai
specialists['Salary Estimate'].value_counts()
$93K-$151K (Glassdoor est.) 62 $60K-$124K (Glassdoor est.) 62 $113K-$180K (Glassdoor est.) 61 $43K-$76K (Glassdoor est.) 61 $71K-$122K (Glassdoor est.) 60 .. $10-$26 Per Hour(Glassdoor est.) 3 $36K-$67K (Glassdoor est.) 3 $43K-$77K (Glassdoor est.) 3 $57K-$70K (Glassdoor est.) 2 $41K-$74K (Glassdoor est.) 1 Name: Salary Estimate, Length: 231, dtype: int64
# Patikrinkime ar yra NaN reikšmių šiame stulpelyje
specialists['Salary Estimate'].isna().sum()
1
# Kadangi stulpelyje 'Salary Estimate' turime tik vieną NaN reikšmę, ištriname eilutę, turinčią šią reikšmę
specialists.dropna(subset=['Salary Estimate'], inplace=True)
# Patikrinkime ar ištrinėme
specialists['Salary Estimate'].isna().sum()
0
# Sukūrkime naująjį kintamąjį iš 'Salary Estimate', kuriame ištriname skliaustelius, įrašus juose, raidę K ir simbolį '$'
salary = specialists['Salary Estimate'].apply(lambda x : x.split('(Glassdoor est.)')[0]).apply(lambda x : x.replace('K','').replace('$','').replace('Per Hour','').replace('(Employer est.)', ''))
salary
0 37-66 1 37-66 2 37-66 3 37-66 4 37-66 ... 6157 55-112 6158 55-112 6159 55-112 6160 55-112 6161 55-112 Name: Salary Estimate, Length: 6161, dtype: object
# Sukūrkime stulpelius su minimalaus, maksimalaus ir vidutinio atlyginimo reikšmėmis
specialists['salary_min'] = salary.apply(lambda x : int(x.split('-')[0]))
specialists['salary_max'] = salary.apply(lambda x : int(x.split('-')[1]))
specialists['salary_avg'] = (specialists.salary_max + specialists.salary_min) / 2
# Patikriname ar susikūrė nauji stulpeliai
specialists.head(3)
Job Title | Salary Estimate | Job Description | Rating | Company Name | Location | Headquarters | Size | Founded | Type of ownership | Industry | Sector | Revenue | Competitors | Easy Apply | specialist | salary_min | salary_max | salary_avg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Data Analyst, Center on Immigration and Justic... | $37K-$66K (Glassdoor est.) | Are you eager to roll up your sleeves and harn... | 3.2 | Vera Institute of Justice\n3.2 | New York, NY | New York, NY | 201 to 500 employees | 1961.0 | Nonprofit Organization | Social Assistance | Non-Profit | $100 to $500 million (USD) | NaN | TRUE | Analyst | 37 | 66 | 51.5 |
1 | Quality Data Analyst | $37K-$66K (Glassdoor est.) | Overview\n\nProvides analytical and technical ... | 3.8 | Visiting Nurse Service of New York\n3.8 | New York, NY | New York, NY | 10000+ employees | 1893.0 | Nonprofit Organization | Health Care Services & Hospitals | Health Care | $2 to $5 billion (USD) | NaN | NaN | Analyst | 37 | 66 | 51.5 |
2 | Senior Data Analyst, Insights & Analytics Team... | $37K-$66K (Glassdoor est.) | We’re looking for a Senior Data Analyst who ha... | 3.4 | Squarespace\n3.4 | New York, NY | New York, NY | 1001 to 5000 employees | 2003.0 | Company - Private | Internet | Information Technology | Unknown / Non-Applicable | GoDaddy | NaN | Analyst | 37 | 66 | 51.5 |
Iš interneto platybių Skills DA and DS https://bootcamp.cvn.columbia.edu/blog/data-analyst-skills/ Below, we've listed the top 11 technical and soft skills required to become a data analyst: Data Visualization Tableau Power BI Data Cleaning MATLAB R Python SQL and NoSQL Machine Learning Linear Algebra and Calculus Microsoft Excel Critical Thinking Communication
https://www.simplilearn.com/what-skills-do-i-need-to-become-a-data-scientist-article Statistical analysis and computing Machine Learning Deep Learning Processing large data sets Data Visualization Tableau Power BI Data Wrangling Mathematics R Python Statistics
# Sukūrkime naujus stulpelius su įgūdžiais, kur Įgūdžio stulpelio eilutėje bus įrašas 1 jeigu yra 'Job Description' stulpelyje, priešingu atveju
# bus įrašytas 0
specialists['Python'] = specialists['Job Description'].apply(lambda x : 1 if 'python' in x.lower() else 0)
specialists['SQL'] = specialists['Job Description'].apply(lambda x : 1 if 'SQL' in x.lower() or 'SQL' in x else 0)
specialists['R'] = specialists['Job Description'].apply(lambda x : 1 if 'R ' in x else 0)
specialists['SAS'] = specialists['Job Description'].apply(lambda x : 1 if 'SAS' in x else 0)
specialists['Java'] = specialists['Job Description'].apply(lambda x : 1 if 'java' in x.lower() else 0)
specialists['SPSS'] = specialists['Job Description'].apply(lambda x : 1 if 'spss' in x.lower() else 0)
specialists['Excel'] = specialists['Job Description'].apply(lambda x : 1 if 'excel' in x.lower() else 0)
specialists['Tableau'] = specialists['Job Description'].apply(lambda x : 1 if 'tableau' in x.lower() else 0)
specialists['PowerBI'] = specialists['Job Description'].apply(lambda x : 1 if 'powerbi' in x.lower() or 'power bi' in x.lower() else 0)
specialists['Machine Learning'] = specialists['Job Description'].apply(lambda x : 1 if 'machine learning' in x.lower() or 'ML' in x else 0)
specialists['Statistics'] = specialists['Job Description'].apply(lambda x : 1 if 'statistics' in x.lower() or 'stats' in x.lower() or 'statistical' in x.lower() else 0)
specialists[['Python','R','SAS','SQL','Java','SPSS','Excel','Tableau','PowerBI', 'Machine Learning', 'Statistics']].head(3)
Python | R | SAS | SQL | Java | SPSS | Excel | Tableau | PowerBI | Machine Learning | Statistics | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
2 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
# Pažiūrėkime kiek radome skirtingų įgūdžių stulpelyje 'Job Description'
skills = {'Python' : specialists['Python'].sum(),
'SQL' : specialists['SQL'].sum(),
'R' : specialists['R'].sum(),
'SAS' : specialists['SAS'].sum(),
'Java' : specialists['Java'].sum(),
'Excel' : specialists['Excel'].sum(),
'SPSS' : specialists['SPSS'].sum(),
'Tableau' : specialists['Tableau'].sum(),
'PowerBI' : specialists['PowerBI'].sum(),
'Machine Learning' : specialists['Machine Learning'].sum(),
'Statistics' : specialists['Statistics'].sum()}
skills
{'Python': 2693, 'SQL': 3392, 'R': 1251, 'SAS': 803, 'Java': 996, 'Excel': 3207, 'SPSS': 294, 'Tableau': 1292, 'PowerBI': 534, 'Machine Learning': 1880, 'Statistics': 2638}
specialists.groupby('specialist')['Python'].value_counts()
specialist Python Analyst 0 1616 1 636 Scientist 1 2057 0 1852 Name: Python, dtype: int64
specialists.groupby('specialist')['SQL'].value_counts()
specialist SQL Analyst 1 1378 0 874 Scientist 1 2014 0 1895 Name: SQL, dtype: int64
# Ištrinkime simbolius stulpelyje 'Company Name' nuo '\n'
specialists['Company Name'] = specialists['Company Name'].apply(lambda x: str(x)[:-4])
specialists.head(3)
Job Title | Salary Estimate | Job Description | Rating | Company Name | Location | Headquarters | Size | Founded | Type of ownership | ... | SQL | R | SAS | Java | SPSS | Excel | Tableau | PowerBI | Machine Learning | Statistics | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Data Analyst, Center on Immigration and Justic... | $37K-$66K (Glassdoor est.) | Are you eager to roll up your sleeves and harn... | 3.2 | Vera Institute of Justice | New York, NY | New York, NY | 201 to 500 employees | 1961.0 | Nonprofit Organization | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | Quality Data Analyst | $37K-$66K (Glassdoor est.) | Overview\n\nProvides analytical and technical ... | 3.8 | Visiting Nurse Service of New York | New York, NY | New York, NY | 10000+ employees | 1893.0 | Nonprofit Organization | ... | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
2 | Senior Data Analyst, Insights & Analytics Team... | $37K-$66K (Glassdoor est.) | We’re looking for a Senior Data Analyst who ha... | 3.4 | Squarespace | New York, NY | New York, NY | 1001 to 5000 employees | 2003.0 | Company - Private | ... | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
3 rows × 30 columns
# Ištraukime valstijų/valstybių santrumpą iš stulpelio 'Location', įrašydami jas į naują stulpelį
# Taip pat pažiūrėkime visas unikalias reikšmes
specialists['Location_state'] = specialists['Location'].apply(lambda x : x.split(',')[1])
specialists['Location_state'].unique()
array([' NY', ' NJ', ' UT', ' CA', ' VA', ' FL', ' IL', ' TX', ' AZ', ' PA', ' DE', ' OH', ' NC', ' SC', ' IN', ' WA', ' GA', ' KS', ' CO', ' Arapahoe', ' United Kingdom'], dtype=object)
# Pakeiskime 'Arapahoe' iš 'Location_state' į 'CO', 'United Kingdom' į 'UK'
specialists['Location_state'] = specialists['Location_state'].replace('Arapahoe', 'CO', regex=True)
specialists['Location_state'] = specialists['Location_state'].replace('United Kingdom', 'UK', regex=True)
specialists['Location_state'].unique()
array([' NY', ' NJ', ' UT', ' CA', ' VA', ' FL', ' IL', ' TX', ' AZ', ' PA', ' DE', ' OH', ' NC', ' SC', ' IN', ' WA', ' GA', ' KS', ' CO', ' UK'], dtype=object)
specialists.head(3)
Job Title | Salary Estimate | Job Description | Rating | Company Name | Location | Headquarters | Size | Founded | Type of ownership | ... | R | SAS | Java | SPSS | Excel | Tableau | PowerBI | Machine Learning | Statistics | Location_state | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Data Analyst, Center on Immigration and Justic... | $37K-$66K (Glassdoor est.) | Are you eager to roll up your sleeves and harn... | 3.2 | Vera Institute of Justice | New York, NY | New York, NY | 201 to 500 employees | 1961.0 | Nonprofit Organization | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | NY |
1 | Quality Data Analyst | $37K-$66K (Glassdoor est.) | Overview\n\nProvides analytical and technical ... | 3.8 | Visiting Nurse Service of New York | New York, NY | New York, NY | 10000+ employees | 1893.0 | Nonprofit Organization | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | NY |
2 | Senior Data Analyst, Insights & Analytics Team... | $37K-$66K (Glassdoor est.) | We’re looking for a Senior Data Analyst who ha... | 3.4 | Squarespace | New York, NY | New York, NY | 1001 to 5000 employees | 2003.0 | Company - Private | ... | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | NY |
3 rows × 31 columns
# Kadangi jau turime valstijų/valstybių santrumpų atsikrą stulpelį, tai ištrinkime santrumpas stulpelyje 'Location'
specialists['Location'] = specialists['Location'].apply(lambda x: str(x)[:-4])
specialists.head(2)
Job Title | Salary Estimate | Job Description | Rating | Company Name | Location | Headquarters | Size | Founded | Type of ownership | ... | R | SAS | Java | SPSS | Excel | Tableau | PowerBI | Machine Learning | Statistics | Location_state | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Data Analyst, Center on Immigration and Justic... | $37K-$66K (Glassdoor est.) | Are you eager to roll up your sleeves and harn... | 3.2 | Vera Institute of Justice | New York | New York, NY | 201 to 500 employees | 1961.0 | Nonprofit Organization | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | NY |
1 | Quality Data Analyst | $37K-$66K (Glassdoor est.) | Overview\n\nProvides analytical and technical ... | 3.8 | Visiting Nurse Service of New York | New York | New York, NY | 10000+ employees | 1893.0 | Nonprofit Organization | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | NY |
2 rows × 31 columns
# Patikrinkime kiek kiekviename lentelės stulpelyje yra NaN reikšmių
specialists.isna().sum()
Job Title 0 Salary Estimate 0 Job Description 0 Rating 681 Company Name 0 Location 0 Headquarters 412 Size 392 Founded 1637 Type of ownership 392 Industry 899 Sector 899 Revenue 392 Competitors 4491 Easy Apply 5917 specialist 0 salary_min 0 salary_max 0 salary_avg 0 Python 0 SQL 0 R 0 SAS 0 Java 0 SPSS 0 Excel 0 Tableau 0 PowerBI 0 Machine Learning 0 Statistics 0 Location_state 0 dtype: int64
# Ištriname stulpelius, kuriuose trūksta daug reikšmių, bei kurių nenaudosime analizėje
specialists = specialists.drop(['Salary Estimate', 'Founded', 'Competitors', 'Easy Apply'], axis = 1)
specialists.head(3)
Job Title | Job Description | Rating | Company Name | Location | Headquarters | Size | Type of ownership | Industry | Sector | ... | R | SAS | Java | SPSS | Excel | Tableau | PowerBI | Machine Learning | Statistics | Location_state | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Data Analyst, Center on Immigration and Justic... | Are you eager to roll up your sleeves and harn... | 3.2 | Vera Institute of Justice | New York | New York, NY | 201 to 500 employees | Nonprofit Organization | Social Assistance | Non-Profit | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | NY |
1 | Quality Data Analyst | Overview\n\nProvides analytical and technical ... | 3.8 | Visiting Nurse Service of New York | New York | New York, NY | 10000+ employees | Nonprofit Organization | Health Care Services & Hospitals | Health Care | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | NY |
2 | Senior Data Analyst, Insights & Analytics Team... | We’re looking for a Senior Data Analyst who ha... | 3.4 | Squarespace | New York | New York, NY | 1001 to 5000 employees | Company - Private | Internet | Information Technology | ... | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | NY |
3 rows × 27 columns
specialists.shape
(6161, 27)
specialists[['salary_avg']].describe()
salary_avg | |
---|---|
count | 6161.000000 |
mean | 94.802143 |
std | 38.040594 |
min | 18.000000 |
25% | 64.000000 |
50% | 85.500000 |
75% | 122.000000 |
max | 225.000000 |
# Atlyginimų pasiskirstymas (Histograma) imant abį specialybes
plt.figure(figsize=(4,2))
axes = sns.histplot(x='salary_avg', data = specialists, bins = 20, kde=True)
axes.axvline(specialists['salary_avg'].mean(), color='r', label = 'mean')
axes.axvline(specialists['salary_avg'].median(), color = 'g', label = 'median')
plt.xlabel('Average Salary')
axes.legend()
plt.show()
# Dabar pažiūrėkime atskirai pagal specialybes atlyginimų pasiskirstymus
specialists.groupby('specialist')[['salary_avg']].describe().transpose()
specialist | Analyst | Scientist | |
---|---|---|---|
salary_avg | count | 2252.000000 | 3909.000000 |
mean | 72.123002 | 107.867741 | |
std | 23.600734 | 38.639390 | |
min | 33.500000 | 18.000000 | |
25% | 58.000000 | 73.000000 | |
50% | 69.000000 | 104.500000 | |
75% | 80.500000 | 133.000000 | |
max | 150.000000 | 225.000000 |
specialists['specialist'].value_counts()
Scientist 3909 Analyst 2252 Name: specialist, dtype: int64
# Data analyst
plt.figure(figsize=(4,2))
axes_a = sns.histplot(x='salary_avg', data = specialists.loc[:2251], bins = 20, kde=True)
axes_a.axvline(specialists['salary_avg'].mean(), color='r', label = 'mean')
axes_a.axvline(specialists['salary_avg'].median(), color = 'g', label = 'median')
plt.xlabel('Analyst_Average Salary')
axes_a.legend()
# Data scientist
plt.figure(figsize=(4,2))
axes_s = sns.histplot(x='salary_avg', data = specialists.loc[2252:], bins = 20, kde=True)
axes_s.axvline(specialists['salary_avg'].mean(), color='r', label = 'mean')
axes_s.axvline(specialists['salary_avg'].median(), color = 'g', label = 'median')
plt.xlabel('Scientist_Average Salary')
axes_s.legend()
plt.show()
Iš gautų grafikų matome, reikšmių pasiskirstymai išsikreipę į kairę pusę, t.y. dominuoja mažesni už vidurkį atlyginimai labiau analitiko pozicijoje, negu mokslininko
# Pažiūrėkime kokius įgūžius ir jų kiekius turime pagal specialistus atskirai
specialists.groupby('specialist')[['Python','R','SAS','SQL','Java','SPSS','Excel','Tableau','PowerBI', 'Machine Learning', 'Statistics']].sum()
Python | R | SAS | SQL | Java | SPSS | Excel | Tableau | PowerBI | Machine Learning | Statistics | |
---|---|---|---|---|---|---|---|---|---|---|---|
specialist | |||||||||||
Analyst | 636 | 400 | 359 | 1378 | 204 | 154 | 1353 | 619 | 248 | 328 | 844 |
Scientist | 2057 | 851 | 444 | 2014 | 792 | 140 | 1854 | 673 | 286 | 1552 | 1794 |
Iš gautos lentelės skaičių daryti igūdžių palyginimą tarp specialistų būtų netikslinga, kadangi darbo skelbimų duomenų mokslininko pozicijai yra daugiau, negu duomenų analitiko. Taigi išrikiuokime mažėjimo tvarka turimus įgūdžius ir pažiūrėkime TOP5 kiekvienoje specialybėje atskirai ir palyginkime
# Analyst
skills_a=pd.DataFrame(specialists[specialists['specialist']=='Analyst'][['Python','R','SAS','SQL','Java','SPSS','Excel','Tableau','PowerBI', 'Machine Learning', 'Statistics']].sum().sort_values(ascending=False))
skills_a.rename(columns={0:'count_a'}, inplace=True)
skills_a['count_a_proc'] = skills_a['count_a']/skills_a['count_a'].sum()*100
skills_a=skills_a.reset_index()
skills_a.rename(columns={'index':'skill'}, inplace=True)
skills_a
skill | count_a | count_a_proc | |
---|---|---|---|
0 | SQL | 1378 | 21.125249 |
1 | Excel | 1353 | 20.741990 |
2 | Statistics | 844 | 12.938832 |
3 | Python | 636 | 9.750115 |
4 | Tableau | 619 | 9.489499 |
5 | R | 400 | 6.132148 |
6 | SAS | 359 | 5.503603 |
7 | Machine Learning | 328 | 5.028361 |
8 | PowerBI | 248 | 3.801932 |
9 | Java | 204 | 3.127395 |
10 | SPSS | 154 | 2.360877 |
# Scientist
skills_s=pd.DataFrame(specialists[specialists['specialist']=='Scientist'][['Python','R','SAS','SQL','Java','SPSS','Excel','Tableau','PowerBI', 'Machine Learning', 'Statistics']].sum().sort_values(ascending = False))
skills_s.rename(columns={0:'count_s'}, inplace=True)
skills_s['count_s_proc'] = skills_s['count_s']/skills_s['count_s'].sum()*100
skills_s=skills_s.reset_index()
skills_s.rename(columns={'index':'skill'}, inplace=True)
skills_s
skill | count_s | count_s_proc | |
---|---|---|---|
0 | Python | 2057 | 16.512804 |
1 | SQL | 2014 | 16.167617 |
2 | Excel | 1854 | 14.883198 |
3 | Statistics | 1794 | 14.401541 |
4 | Machine Learning | 1552 | 12.458858 |
5 | R | 851 | 6.831500 |
6 | Java | 792 | 6.357871 |
7 | Tableau | 673 | 5.402585 |
8 | SAS | 444 | 3.564261 |
9 | PowerBI | 286 | 2.295898 |
10 | SPSS | 140 | 1.123866 |
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(16, 4))
skills_a.plot.bar(x='skill', y='count_a', title='Data analyst skills', color='orange',ax=axs[0]);
skills_s.plot.bar(x='skill', y='count_s', title='Data scientist skills',ax=axs[1]);
# Pažiūrėkime kokiuose darbo sektoriuose reikalingi specialistai
specialists_sectors = specialists['Sector'].value_counts().dropna().sort_values(ascending=False).head(n=10)
fig, axs = plt.subplots(nrows=1, ncols=1, figsize=(5, 5))
specialists_sectors.plot.pie(autopct='%1.1f%%', ylabel="");
# Pažiūrėkime kokiuose darbo sektoriuose reikalingi specialistai atskirai
specialists_sectors_s = specialists[specialists['specialist'] == 'Scientist']['Sector'].value_counts().dropna().sort_values(ascending=False).head(n=10)
specialists_sectors_a = specialists[specialists['specialist'] == 'Analyst']['Sector'].value_counts().dropna().sort_values(ascending=False).head(n=10)
fig, axs = plt.subplots(nrows=1, ncols=1, figsize=(5, 5))
specialists_sectors_a.plot.pie(autopct='%1.1f%%', ylabel="", title='Data analyst/Sectors');
fig, axs = plt.subplots(nrows=1, ncols=1, figsize=(5, 5))
specialists_sectors_s.plot.pie(autopct='%1.1f%%', ylabel="", title='Data scientist/Sectors');
# Pažiūrėkime kokie atlyginimai svyruoja daugiausią darbo pasiūlymų turinčiuose sektoriuose
analyst = specialists[specialists['specialist'] == 'Analyst']
scientist = specialists[specialists['specialist'] == 'Scientist']
plt.figure(figsize = (8,4))
sns.boxplot(x = 'salary_avg', y = 'Sector', data = analyst, order = ['Information Technology', 'Business Services', 'Biotech & Pharmaceuticals','Finance', 'Health Care'], palette = 'GnBu_d')
plt.xlabel('Average Salary')
plt.ylabel('Sector')
plt.title('Data Analyst/Average Salary by Sector', fontsize = 12)
plt.figure(figsize = (8,4))
sns.boxplot(x = 'salary_avg', y = 'Sector', data = scientist, order = ['Information Technology', 'Business Services', 'Biotech & Pharmaceuticals','Finance', 'Health Care'], palette = 'GnBu_d')
plt.xlabel('Average Salary')
plt.ylabel('Sector')
plt.title('Data Scientist/Average Salary by Sector', fontsize = 12)
plt.show()
# Išrinkime TOP10 miestų ir TOP10 įmonių kuriuose yra siūlomos abi darbo pozicijos
specialists_locations = specialists['Location'].value_counts().sort_values(ascending=False).head(n=10)
specialists_companies = specialists['Company Name'].value_counts().sort_values(ascending=False).head(n=10)
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(16, 4))
specialists_locations.plot.bar(ylabel="No. of jobs", xlabel="", color='orange', ax=axs[0], title='TOP10 Locations');
specialists_companies.plot.bar(ylabel="No. of jobs", xlabel="", ax=axs[1], title='TOP10 Companies');
# Išrinkime TOP10 miestų ir TOP10 įmonių / Data analyst
specialists_locations_a = specialists[specialists['specialist'] == 'Analyst']['Location'].value_counts().sort_values(ascending=False).head(n=10)
specialists_companies_a = specialists[specialists['specialist'] == 'Analyst']['Company Name'].value_counts().sort_values(ascending=False).head(n=10)
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(16, 4))
specialists_locations_a.plot.bar(ylabel="No. of jobs", xlabel="", color='orange', ax=axs[0], title='TOP10 Data analyst/Locations');
specialists_companies_a.plot.bar(ylabel="No. of jobs", xlabel="", ax=axs[1], title='TOP10 Data analyst/Companies');
# Išrinkime TOP10 miestų ir TOP10 įmonių / Data scientist
specialists_locations_s = specialists[specialists['specialist'] == 'Scientist']['Location'].value_counts().sort_values(ascending=False).head(n=10)
specialists_companies_s = specialists[specialists['specialist'] == 'Scientist']['Company Name'].value_counts().sort_values(ascending=False).head(n=10)
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(16, 4))
specialists_locations_s.plot.bar(ylabel="No. of jobs", xlabel="", color='orange', ax=axs[0], title='TOP10 Data scientist/Locations');
specialists_companies_s.plot.bar(ylabel="No. of jobs", xlabel="", ax=axs[1], title='TOP10 Data scientist/Companies');
# Kadangi 'Job Description' stulpelio eilutėse yra labai daug teksto, mes jį pašalinsime, prieš įkeliant į Tableau.
final=specialists.drop(columns=['Job Description'])
# final
import os
# os.getcwd()
os.makedirs('baigiamasis', exist_ok=True)
final.to_csv('baigiamasis/final_work.csv')
Pagal iškeltus darbe tikslus, galime padaryti sekančias išvadas:
1. Pagal Glassdoor darbo skelbimus duomenų mokslininkų vidutinis atlyginimas (107K) yra gerokai didesnis už duomenų analitiko (72K).
2. Imant TOP5 reikalaujamų kompiuterinių įgūdžių specialistams, pamatėme, kad abiems dominuoja praktiškai vienodi.
Duomenų mokslininkui - 'Python','SQL','Excel', 'Statistics', 'Machine Learning',
o duomenų analitikui - 'SQL', 'Excel', 'Statistics', 'Python','Tableau'.
3. Populiariausi darbo sektoriai Glassdoor puslapyje abiems specialistams yra 'Information Technology', 'Business services'ir 'Finance', o sektoriuje 'Biotech & Pharmaceuticals' jau yra reikalingi labiau duomenų mokslininkai.
Kas liečia atlyginimus pagal darbo sektorius, tai duomenų analitiko pozicijai geriasią atlygį siūlo 'Biotech & Pharmaceuticals' sektorius, o duomenų mokslininkui 'Information Technology'.
Didžiausia paklausa duomenų mokslininkų yra 'Apple', 'IBM' ir 'Amazon' įmonėse, tuo tarpu duomenų analitikų - 'Staffigo Technical Services' LLC, 'Diverse Lynx', Lorven Technologies Inc.